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ABSTRACT 

Accurate direct A"-body simulations help to obtain detailed information about the 
dynamical evolution of star clusters. They also enable comparisons with analytical 
models and Fokker-Planck or Monte-Carlo methods. NBODY6 is a well-known direct 
iV-body code for star clusters, and NBODY6-I—I- is the extended version designed for 
large particle number simulations by supercomputers. We present NBODY6-I--I-GPU, 
an optimized version of NBODY6-I--I- with hybrid parallelization methods (MPI, 
GPU, OpenMP, and AVX/SSE) to accelerate large direct Wbody simulations, and 
in particular to solve the million-body problem. We discuss the new features of the 
NBODY6-I--I-GPU code, benchmarks, as well as the first results from a simulation of a 
realistic globular cluster initially containing a million particles. For million-body sim¬ 
ulations, NBODY6-I--I-GPU is 400 — 2000 times faster than NBODY6 with 320 CPU 
cores and 32 NVIDIA K20X GPUs. With this computing cluster specification, the 
simulations of million-body globular clusters including 5 % primordial binaries require 
about an hour per half-mass crossing time. 

Key words: methods: numerical - globular clusters: general 


1 INTRODUCTION 

Direct simulations of star clusters have a long history. As 
algorithms and hardware have improved, larger numbers of 
stars could be simulated, allowing a more realistic represen¬ 
tation of t he dynamical evolution of globular star clusters. 
NBODY6 Ia arse thll2003l'l is a state-of-the-art direct A-body 
simulation code specifically designed for star clusters. It uses 
several algorithms to enhance the computing speed and ac¬ 
curacy, especially for strong interactions that arise from 
a large fraction of binaries and relatively short relaxation 


timescales (< 100 Myr for typical open clusters and < 1 Gyr 
for typical globular clusters) in collisional and dense stellar 
systems. Here, the terms “collisional” and “dense” are not 
well defined in the literature . The classical two-bo d y relax¬ 
ation time, as defined e.g. bv IChandrasekhaii (ll942l i: ISpitzeii 
lll98itl . describes how important distant gravitational two- 
body encounters are for the orbital motion of stars. If the 
relaxation time is very long, a system is denoted as “colli¬ 
sionless” (for example, galactic disks or bulges); the motion 
of stars is entirely determined by the smooth mean gravi¬ 
tational field of the system. If the relaxation time is short 
(e.g., shorter than the lifetime of the system) we denote the 
cluster as “collisional” (e.g., globular and open star clusters, 
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nuclear star clusters). If the stellar density is high enough, 
close two-body gravitational encounters and stellar collisions 
may occur. This aspect is crucial when studying “dense” star 
clusters. In dense and collisional star clusters a correct in¬ 
tegration of stellar motions requires pairwise gravitational 
interactions to be included between many if not all stars in 
the cluster. This is the situation for which codes such as 
NBODY6 are designed. 

Direct A^-body simulation of star clusters can be very 
time consuming. In a system with N particles, the full force 
calculation cost of one particle scales with 0{N). With in¬ 
dividual time steps for each particle, the cost per crossing 
time (ter) depends on the number of steps per particle (Ns) 


which varies with different time ste p criteria, integration 
met hods and star cluster prop erties. iMakino Sz Hud (Il988l i 
and iMakino fc AarsethI (Il992l i found that for the Hermite 
scheme with a time s tep criterion based on relative force 
change (lAarsethlll985ll . Ns is roughly proportional to 
for systems with homogeneous density. Thus, when using in¬ 
dividual time steps the total computational cost per crossing 
time of N particles scales with For systems with a 

power-law density distribution p oc r““. Ns depends on the 
power index a. Then the cost per crossing time scales with 
0(N'^^^) forQ < 24/1 1 and o(Ar( 6 -<»)/( 6 - 2 a)^ ^ > 24/11 

llMakino fc Hutl[l988l l. Considering the half- mass relaxation 
timescale, trh i s prop ortional to taN/In N dSnitzerl 1 19871 : 
ISugimoto et al.l 1 19901 '). the computational cost per is 
/ In N) for homogeneous systems and for power-law 
systems with a < 24/11 and 0{N<-^'^-^°‘^/<-^-'^°‘^/lnN) for 
a > 24/11. Thus, an efficient parallelization of a direct N- 
bod y code is necessa r y for l arge particle numbers. 

ISugimoto et al.l lll990l 'l discussed the fundamental 
problem that direct numerical simulations of globular 
star clusters could not be completed for decades if ex¬ 
trapolating the standard evolution of computational 
hardware (Moore’s law). They called for the construction 
of a special-purpose computer GRAPE, which finally 
was successfully initiated and c o mpleted by thei r team 
dMakino. Kokubo. fc TaiiH Il993l : IMakino fc Taiiil 1 19981 : 
IMakino et al. 2003l l . In the following years, graphical 
processing units ( GPU) widely replaced GRAPE (e.g., 
Harfst et al.l l20 07l: iHennebelle. Audit fc Mivill e-Deschened 


20071: 


20071: 


Portegies Zwart. Belleman fc Geldoi 


Befieman, Bedorf fc Portegies Zward 20081 : Schive et al.l 
20081 ) and much of the GRAPE software could b e ported to 


GPU i Gaburov. Harfst fc Portegies Zwartll2009l~). 

_ Spurzeml dl999l~): ISnurzem et aid |20^ and 

iHemsendorf. Khalisi. Omarov fc Spurzeinl (1200311 dis¬ 
cussed several different types of hardware for parallelization 
and extended NBODY6 to NBODY6-b-b for g eneral parallel 
supercomputers. Later, iNitadori fc AarsethI d2012l l devel¬ 
oped a GPU-based parallel force calculation for NBODY6. 
As a result, large Wbody simulations {N 10®) became 
possible on a single desktop computer or workstation with 
GPU hardware. They also implemented the parallel force 
calculation based on Streaming SIMD Extensions (SSE) and 
Advance d Vector Exten si ons ( A VX) for recent GPU archi¬ 
tectur es. [Spurzem_^£^^(|2011 ); iBerezik. Spurzem fc Wan3 
ll2013ti and Berezik et al.l (2013) discussed the performance 
of large A^-body simulations with the GPU-accelerated 
codes (/GPU and a provisional version of NBODY6++GPU. 

With these parallelization methods, we can now study 


star clusters w i th a number of stars exceeding 10®. 
iHurlev k. Sharal (l2012lf simulated 200,000 stars includ¬ 
ing 5000 primordial binaries with initial half-mass radius 
4.7 pc using NBODY4 on a GRAPE-6 based computer 
to investigate c o re co llapse and core oscillation. Later, 
ISippel Hurle'v^ (l2013l ) studied the multiple stellar-mass 
black holes in globular clusters by simulating 262, 500 stars 
including 12, 500 primordial binaries with initial half-mass 
radius 6.2 pc using NBODY6-GPU. The current largest 
direct A-body simulation modeling the globular cluster 
M4 used one computing node including 12 Intel Xeon 
X5650 cores (2.66 GHz per core) and 2 NVIDIA TESLA 
C2050 GPUs with 448 cores each (1.15 GHz per core) with 
NBODY6-GPU dHeggid [2014l L This simulation contained 
500, 000 stars with 7% binaries and a small half mass radius 
of 0.58 pc. Nowadays, we can make an effort to reach one 
million stars by using parallel supercomputers with GPUs. 

In this paper, we first introduce the parallel algorithm 
used by NBODY6-GPU and NBODY6-f-f in Section H 
Then we describe the new version of NBODY6H—hGPU with 
a hybrid parallel method and also the new algorithms that 
are necessary for large number of particle parallelization in 
Section O Performance tests are carried out in Section [I] In 
Section [3 we show an application to a globular cluster with 
one million stars. In Section[6l we discuss the parallelization 
limit and future development of NBODY6++GPU. Finally, 
we present our conclusions in Section [7] 


2 THE FEATURES OF NBODY6/6-|— 

NBODY6 uses the fourth-order Hermite integration method. 
iMakind (1 199 il l presented a careful analysis of the per¬ 
formance and energy error of the Hermite integrator. He 
showed that it reduces to the similar asymptotic error 
behaviour a s the standard Aarseth scheme (fourth-order 
method; see I AarsethI 1 1985l l but it has some advantages in 
the time step choice and data structure. 

The hierarchical block time st eps method is used to¬ 
gethe r with the Hermite integrator jMcMillanll 19861 : iMakind 
liggiTl in NBODY6, which avoids the overheads of particle 
position and velocity prediction in an individual time step 
method. In this method, particle time steps are adjusted to 
quantized values, usually an integer power of 0.5. Then at 
each time step, active particles (the particles that satisfy the 
time step criterion) are integrated together. 

To speed up the force calculation, NBODY6 uses the 
Ahm ad-Cohen (AG) neighbor scheme (lAhmad GohenI 
[Tizi). The basic idea is to employ a neighbor fist for each 
particle. The integration is separated into two parts: reg¬ 
ular force integration for large time steps (regular steps) 
and irregular force integration for small time steps (irregu¬ 
lar steps). The regular force is the summation of the forces 
from particles outside the neighbor radius and the irregu¬ 
lar force accumulates only the neighbor forces. During the 
irregular step, the regular force and its first order deriva¬ 
tive calculated at the last regular step are used for position 
and velocity prediction. The AC scheme gains efficiency with 
sequential computing (without parallelization). The speed 

J ained by the AG scheme is roughly proporti onal to A^^^ 
Makino fc Hutl[l988l : IMakino fc Aarsethiri992l L However, in 
parallel computing, this gain is limited by the complexity of 
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the implementation of this algorithm (see Section lU. Also, 
the benefit of reducing overheads of particle prediction in the 
block time step method is strongly limited in the neighbor 
sche me. _ _ 

IPortegies Zwart fc Boekhold (l2014h discussed the inte¬ 
gration accuracy requirement for self-gravitating systems 
simulated with direct A^-body codes. They found that for 
three-body systems the integration should have total en¬ 
ergy conserved better than 1/lOth. Although this accuracy 
requirement is uncertain when the simulation is extended to 
large particle number systems, this work indicates the im¬ 
portance of careful integration treatment for direct A^-body 
systems. One im portant feature of NBODY6 is t hat it uses 
the algorithms of iKustaanheimo fc Stie:^ (Il965[l (herea fter 
KS) and chain regularization ( Mikkola fc Aarseth Il993li to 
deal with an accurate solution of close encounters, bina¬ 
ries and multiple systems, which play a significant role in 
star cluster dynamical evolution. These strong interactions 
require very small time steps during integration and may 
produce large errors with standard integrators such as the 
Hermite scheme. Using KS and chain regularization is also 
the most important feature of NBODY6 for star cluster sim¬ 
ulations. 

Here, we have briefly introduced the main algorithm 
used in NBODY6/6-I—1-. In the next section we will focus on 
the parallelization of the codes. 


3 PARALLELIZATION OF NBODY6-|-+GPU 

3.1 MPI parallelization of NBODY6-|--|- 

urze ini (Il999ll and iHemsendorf. Khalisi. Omarov fc Spurzeml 
(l2003tl developed NBODY6++ based on NBODY6 using 
MPI parallelization with the copy algorithm. Both regu¬ 
lar and irregular forces were parallelized. Here different 
MPI processors calculate different subsets of the active 
particles. Each MPI processor has the complete particle 
dataset. Another available parallel algorithm is the ring 
algorithm which splits the full particle dataset for different 
MPI processors. It reduces the memory cost in each MPI 
process. The benefit of the copy algorithm compared to 
the ring algorithm is that there is no requirement for 
extra communication of the neighbor particle data which 
is not in the same MPI process during the irregular force 
calculation. The disadvantage is the particle number limit 
due to memory size on the computing node. The MPI 
communication with the copy algorithm has constant time 
cost (independent of MPI processor number except for 
latency). The scaling of the regular force with different MPI 
processors is very good. Since the regular force dominates 
the calculation, t his results in a good scaling of the tota l 
computing time. iDorband. Hemsendorf fc Merritt (l2003l f 
provide d a detailed di s cussio n of t hese communic ation algo¬ 
rithms. iLippert et al.l (Il998lf and iMakind (l2002lf suggested 
an efficient communication algorithm (hypersystolic) for 
extremely large processor numbers. 


3.2 Basic NBODY6-GPU implementation 

After the GPU computing (CUBA) became popular, the 
shared memory parallel NBODY6-GPU code was developed 


for w orkstation and desktop computers (iNitadori fc AarsethI 
l2012h . The OpenMP, GPU (GUDA) and AVX/SSE paral¬ 
lel methods are used to make the code as fast as possible. 
However, NBODY6-GPU can only be used in a single node 
(no massively parallel MPI implementation) so the number 
of particles is limited for a reasonable simulation time. 


3.2.1 Regular force and potential (GPU) 

The GPU library of INitadori fc AarsethI (l2012h is used for 
calculating the regular force, which dominates the direct in¬ 
tegration, and potential energy calculation. The cost for reg¬ 
ular force calculation per particle scales with 0{N) and for 
potential energy calculation scales with 0{N^). The perfor¬ 
mance of GPU force calculation is very good since the pure 
force calculation is easy to parallelize. GPUs also help to ac¬ 
cumulate the neighbor list very efficiently during the regular 
force calculation. 


3.2.2 Prediction and irregular force (AVX/SSE) 

When GPU accelerates the regular force very efficiently, 
the irregular force becomes expensive. However, this part 
is hard to parallelize on GPUs due to the complexity o f 
the AC neighbor scheme. Thus, INitadori fc AarsethI (I2OI2II 
developed the AVX/SSE and OpenMP parallel library for 
neighbor particle prediction and irregular force calculation. 
AVX/SSE is an instruction set for CPUs developed in re¬ 
cent years, which supports vector calculation in the specific 
cache. The advantage of AVX/SSE with OpenMP is that 
there is no extra memory copy compared to GPU. Por both 
AVX/SSE and GPU libraries, the data needs to be copied 
once for changing data structure to obtain computing effi¬ 
ciency. This is because that NBODY6 has a very long devel¬ 
opment history, thus to completely change the data struc¬ 
ture to be consistent with AVX/SSE and GPU libraries is 
very time consuming. But for GPU, there is extra data copy 
from the host memory on the mother board to the device 
memory on GPU. Besides, since the neighbor force calcu¬ 
lation is not efficient for the distributed memory parallel 
method (with MPI parallelization; see discussion below), 
this kind of shared memory parallel method is more effi¬ 
cient. 


3.3 Code improvements in NBODY6H—|-GPU 

In this subsection we describe our new implementations in 
NBODY6-f-fGPU. 

The GPU acceleration, especially of the long-range (reg¬ 
ular) gravitational forces, is very efficient so this part does 
not dominate the computational time any more, as we show 
below. Secondly, the AVX/SSE implementation accelerates 
prediction and neighbor (irregular) forces, which is the next 
most time consuming part of the code. 

We have combined the GPU and AVX/SSE acceler¬ 
ation, which was done for a single node in NBODY6- 
GPU, with the MPI parallelized NBODY6-I—I- designed 
for multi-node computing clusters for the new version 
NBODY6-I—l-GPU. This work requires additional efforts to 
keep the code consistent (see below). In addition, we have 
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worked on remaining bottlenecks, such as time step schedul¬ 
ing and stellar evolution, which become important for mil¬ 
lion bodies because the usual computationally intensive 
tasks have been accelerated very effectively by GPU and 
AVX/SSE. 

3.3.1 New algorithm of selecting active particles for block 
time steps 

For the block time step method, active particles should be 
selected at every time step. It is very expensive to search 
all particles for the active ones, especially for the irregu¬ 
lar force calculation. In this case, for one block time step 
the cost of selecting active particles scales with 0{N) while 
the irregular force calculation cost scale with 0{Ni{Nh)). If 
N ';$> Ni{Nh), the former can be more expensive. When the 
simulation reaches millions of particles, the block time step 
levels can be quite deep (the smallest time step can reach 
0.5^° — 0.5^^) and the deep blocks with few particles and 
small time steps can easily satisfy this condition. One may 
consider to use a temporary list to save particles with small 
time steps and only search all particles at some selected 
time interval from the temporary list each time step. How¬ 
ever, this method is still expensive where there are many 
particles with small time steps (such as the wide binaries 
that are not KS regularized). Indeed, we find that the time 
of selecting active particles can be much larger than the 
irregular integration time, even with this temporary list al¬ 
gorithm for one million particles including 5% primordial 
binaries. Another reason that forces us to deal with this is¬ 
sue is that the active particles selection is very difficult to 
parallelize efficiently (the cost is almost independent of pro¬ 
cessor numbers) and would be prohibitive for a million-body 
simulation. Thus, we propose a better algorithm that uses 
a time step sorti ng list (hereaf ter sorting list algorithm; see 
Figures [T] and El) . I Zhone (l2014l i implemented a similar algo¬ 
rithm for (^GRAPF-|-GPU and evaluated its performance. 
The basic idea is that when we have the index list sorted 
by particle time step from smallest to largest, and the indi¬ 
cators of each boundary offset 7off(i) between the block of 
the same step particles (the largest particle index with step 
0.5”), we only need to find the correct offset at each block 
time step by using the algorithm shown in Figure [T] to select 
active particles (shown as black squares in Figure [2]). After 
integration, we adjust the sorted list by sorting the active 
particles’ new time steps. The specific sorting method for 
this adjustment can be optimized to 0{Ni) if we ignore the 
stability of sorting (stability means no exchange of the or¬ 
der for the particles with same steps) and assume that many 
active particles keep the same step as before or have small 
time step changes. 


3.3.2 The initialization 

The initialization of a simulation in NBODY6 is relatively 
expensive. We improve it with MPI, GPU and OpenMP 
parallelization and a better algorithm. The initial model for 
million-body simulations is very important and needs to be 
carefully tested. This improvement is very useful for fast 
testing of the initial models with large particle numbers, 
especially for a large number of primordial binaries. 



Figure 1. The flow chart for obtaining the correct offset in the 
sorting list algorithm. Tnew is the time after next integration. 
Tnow is current time. is the current smallest time step (the 

time step of the first particle in sorting list), iprev is the previous 
offset. 
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Figure 2. Diagrammatic sketch of sorting list algorithm for se¬ 
lecting active particles. The time step of each particle block is 
0.5* separated by boundary indicators /off(*) (vertical lines). The 
integration advances vertically in the chart. Active particles are 
shown as black squares. 


The initialization of NBODY6 can be divided into four 
parts: 

(i) reading or generating masses, positions, velocities and 
stellar evolution parameters of all stars; 

(ii) scaling all par ameters into A^-body unit s (the Af-body 
unit^ are defined in iHeggie fc Mathi^1l986l l ; 


^ It has been suggested to name the A^-body time unit to honour 
M. Henon as Henon time unit (D.C. Heggie, private communica¬ 
tion) 



































NB0DY6+ + GPU: Ready for the million-body problem 5 


(iii) initialization of forces, neighbor lists and time steps 
of all stars; 

(iv) initialization of primordial KS binaries. 

In the second part of the intialization, the total potential 
energy of the system is needed and costs 0{N^). Actually, 
NBODY6-GPU does this calculation twice for scaling pur¬ 
pose in the case of an external tidal field. The GPU is used in 
NBODY6-GPU to speed up this part and it is very efficient. 
Our new improvements are for the third and fourth parts. 
In the traditional NBODY6-GPU version forces and neigh¬ 
bor lists are initialized separately without parallelization. 
NBODY6-I—I- parallelizes the scaling and initialization of the 
force parts, but only through MPI. For million-body simu¬ 
lations this is very slow and requires hours to be finished. 
We improved it by using GPU based force and neighbor list 
calculations (the same as for the regular force calculation). 
The fourth part is very costly with more than 5% primor¬ 
dial KS binaries in the traditional NBODY6-GPU (several 
hours). During initialization of KS binaries, the force and its 
three derivatives (Hermite scheme) need to be renewed for 
center-of-mass particles. All neighbor lists that contain KS 
binary component indices also need to be replaced by the 
center-of-mass particle indices. The cost is approximately 
0{N{Ni,)NKs) where Aks is the number of primordial KS 
binaries. We find a much simpler way to initialize KS bina¬ 
ries (cost scales with O(Aks)) by just switching the order of 
the third and fourth parts: initialize KS binaries first with¬ 
out recalculating forces, their derivatives and neighbor lists 
(only the KS transformation is needed) and then do the third 
process with the new center-of-mass particles data generated 
by former process instead of each binary component in the 
old way. In this case there is no need to update the forces 
and neighbor lists. 

3.3.3 Position and velocity prediction 

During the force calculation, the predicted positions and ve¬ 
locities are used to calculate the force and its first derivative 
for the Hermite integrator. In principle, we can avoid predic¬ 
tion of the same particles with the AG neighbor scheme and 
block time steps. However, in practice we need to search all 
neighbors of each active particle and the search itself is com¬ 
putationally expensive. Thus, it does not save much time to 
avoid neighbor prediction overlap and it is much simpler to 
predict all neighbors and do the force calculation within one 
loop. The disadvantage of this method is that it costs more 
when the average neighbor number (Ab) multiplied by the 
active particle number Ai is larger than the total particle 
number A, compared to all the particle predictions with a 
non-AC scheme block time step. One solution is to try pre¬ 
dicting all particles once instead of predicting each neighbor 
when (Ab) Ai > A. But the mixture of predicting only neigh¬ 
bors and predicting all particles increases the complexity of 
code. We therefore use only neighbor prediction in the code. 

However, there is a major complication for the parallel 
neighbor prediction in NBODY6-I—l-GPU, which does not 
exist in NBODY6-GPU. Since we use AVX/SSE and GPU 
and the code is mixed with CUBA, C++ and Fortran 77 pro¬ 
gramming language, the AVX/SSE and GPU libraries keep 
the individual copies of particle datasets. Thus, the predic¬ 
tions of particles have overlaps and are usually inconsistent 


for different copies distributed on MPI processors. Due to 
the complexity of NBODY6/6+-I- (e.g., using predicted po¬ 
sitions for regularization) this leads to problems of synchro¬ 
nization later on, such as differences of time steps for the 
same particle on different processors. The safest but very 
costly way is to always predict all particles at every irregu¬ 
lar integration step, which is the case in the older versions of 
NBODY6-I—h. To solve this problem, much effort has been 
made to ensure that every particle is predicted to the current 
time before it is used in stellar evolution, KS and hierarchi¬ 
cal regularization, because these parts are not parallelized 
and should have the same computing results on every MPI 
processor. 

3 . 3.4 Stellar evolution and neighbor force correction 

The neighbor scheme also leads to performance losses for 
the calculation of stellar evolution. When a star experiences 
mass loss, other stars feel a smaller force. In the neighbor 
scheme, the regular force is predicted from the value calcu¬ 
lated at the last regular time step, thus if particles outside 
the neighbor radius experience mass loss between the pre¬ 
vious and next regular time steps, the regular force will be 
inconsistent after that. The correction for the regular force 
should be done for all particles which have the mass loss par¬ 
ticle outside their neighbor radius. To avoid a large value of 
the third and fourth derivatives of the force, the irregular 
force also needs to be updated if the mass loss particle is in¬ 
side the neighbor radius. When mass loss is frequent, the cal¬ 
culation performance will be reduced significantly. We cur¬ 
rently use OpenMP to speed up the force correction, but it 
cannot completely solve this issue since the force correction 
with cost of 0(A) per particle cannot be avoided. 

3.4 Hybrid MPI parallelization 

Based on the above parallel methods, we develop a new ver¬ 
sion of NBODY6H—hGPU to include hybrid parallel proce¬ 
dures. The parallel structure of NBODY6-I—l-GPU is shown 
in FigureO In computer clusters, each computing node uses 
one MPI process. Each MPI process opens multiple threads 
via OpenMP for the irregular force calculation. GPUs inside 
one node are controlled by OpenMP threads. Each GPU has 
a similar particle dataset size for regular force and poten¬ 
tial energy calculation. GPUs of different nodes are isolated 
without communication. Thus all GPUs in the same node 
together access the complete particle dataset. The best code 
configuration is to use multiple GPU cores (such as 8 — 16 
cores) and several GPUs (such as 1 — 4 GPUs with a few 
thousand cores) per node, and choose node numbers based 
on the total number of particles. 

4 PERFORMANCE TEST 
4.1 Pure MPI and hybrid MPI 

Figure [4] shows significant improvement of 
NBODY6-h-hGPU by using hybrid MPI including GPU, as 
compared to the pure MPI case. We see that the GPU gives 
about 33 times faster regular force integration. This is to 
be expected since GPU is designed for large parallelization 
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Table 1. The definitions of abbreviations for all figures 


Abbreviation Definition 


Reg. 

Irr. 

Pred. 

Init.B 

Move 

Comm.I. 

Comm.R. 

Send.L 

Send.R. 

Adjust 

KS 

Barr. 


Regular integration (force) and neighbor list determination 

Irregular integration (force, prediction (AVX/SSE version) and correction) 

Neighbor (Non-AVX/SSE version) and all particles (for regular force) prediction 

Initialization of active particle list for block time step 

Particle data copy prepared for MPI communication 

MPI communication for irregular integration 

MPI communication for receiving integration 

Particle data copy for AVX/SSE irregular force calculation 

Particle data copy for GPU regular force calculation 

Energy checking, adjustment of parameters and data results 

KS regularization calculation (binary and hierarchical systems) 

MPI communication barrier waiting time due to the imbalance and network traffic between different nodes 



Figure 3. NBODY6++GPU code structure. It shows one cycle of simulation. Based on the time steps, the integration can be divided 
into three hierarchical parts (see TableKS calculation (KS), irregular integration (Irr.) and regular integration (Reg.). The KS has 
smallest time step distribution. Thus, between two nearest Irr. block time steps there are several KS steps. Similarly, between two Reg. 
block time steps there are several Irr. time steps. After several Reg. time steps there is one “Adjust” (see Table^. Inside one node, Reg. 
and Adjust are parallelized by multiple GPUs and Irr. is parallelized by AVX/SSE with OpenMP. MPI parallelization are done for all 4 
parts between different nodes. 


by using many computing cores and large memory band¬ 
width within one card. Using AVX/SSE with OpenMP 
gives about 3 times faster irregular integration including 
predictions. OpenMP reduces the MPI communication cost 
by a factor of 5 — 10. The individual MPI communication 
process is not directly sped up by OpenMP. When we use 
MPI parallelization together with OpenMP method, inside 
one node the irregular and regular force calculations are 
done by multiple threads with OpenMP instead of MPI 
parallelization, thus we can set a larger block particle 
number threshold for MPI parallelization by a factor of 
the OpenMP thread number and reduce the total MPI 


communication frequency. This then results in shorter total 
MPI communication time. 

4.2 Scaling with different particle numbers and 
processors 

The scaling with different particle numbers and pro¬ 
cessors demonstrate the possibility of using large com¬ 
puting resources for simulations. We test hybrid paral¬ 
lel NBODY6-I-+GPU scaling with different node numbers 
ATode (1, 2, 4, 8 and 16; up to 320 CPU cores and about 
80k GPU cores) and different particle numbers (16k, 32k, 
64k, 128k, 256k and 1024k) on the “Hydra” cluster of the 
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10000 


1000 


100 


GPU AVX+OpenMP GPU AVX+OpenMP 

3825.7 

~ 2929.2 


OpenMP 


Hybrid MR 
Pure MR 


220.2 



Init. Total Irr. Reg. Red. Init.B Move Comm.l. Comm.R Send.l. Send.R Adjust Barr. 


Figure 4. Comparison of performance between pure MPI and hybrid MPI (GPU + AVX/SSE + OpenMP + MPI) on the “Kepler” 
cluster at ARI, Heidelberg University. The test uses 256k particles with a Plummer model, IMF from Kroupa (2001) with mass range 
0.08 — IOOMq. The hybrid MPI test uses 4 nodes and each node includes 32 Intel Xeon E5-2650 cores (2.00 GHz per core) and 4 NVIDIA 
K20m with 2496 cores each (706 MHz per core). The pure MPI test uses the same configuration of nodes and CPU cores. The label 
“Total” means total time cost for 1 V-body unit and “Init.” denotes the initialization time of the simulations. 


Max-Planck Supercomputing Centre (RZG) Germany. Each 
node is completely controlled (no other tasks on the node) 
and has two NVIDIA K20X with 2688 cores each (732 
MHz per core) and 20 Intel Ivy Bridge cores (2.8 GHz per 
core). The total computing time for one V-body time unit 
7tot/7NB is shown in Figure [5l The irregular and regular 
force integration computing time (Tirr and Tleg) are shown 
in Figure [6l All these three times are the averaged com¬ 
puting times of the first two A-body time units of each 
simulation. We test two basic initial models. One has no 
primordial binaries and another ha s 5% binaries. Both use 
a Plummer sphere Jpi umme 3ll911^ and initial mass func¬ 
tion (IMF) from lKroupa, Tout fc Gilmord (Il993l ) with mass 
range 0.08 — 20Mq and no stellar evolution. 

In the non-binary case, the scaling with different Anode 
for the total time is not ideal because of the communica¬ 
tion cost. Here the speed-up saturates at about 8—16 nodes 
depending on the particle number. But if we consider the 
number of cores per node (20 GPU cores and 5376 GPU 
cores), the scaling with cores is excellent, since with 16 nodes 
320 CPU cores and 86016 GPU cores are used. With one 
node, the performance of NBODY6-I—bGPU is si milar to 
that of NBODY6-GPU. iNitadori fc AarsethI (l2012l ) showed 
that NBODY6-GPU gives about 100 times speed-up com¬ 
pared to the sequential NBODY6 with two NVIDIA GeForce 
GTX 560 Ti with 384 cores each (822 MHz per core) and 4 
Intel i7-2600K cores (3.40 GHz per core). On the “Hydra” 
cluster node, the speed-up can reach 100 — 500 depending 
on the particle number. Thus, with 16 nodes for one million 
particles, we can reach a factor of 400 — 2000 speed-up com¬ 
pared to the sequential NBODY6. Besides, the absolute time 
cost is very good, especially for the million-body case, the 


total time is about 800 s for Anode = 16. For the cjiGPU code 
tested in the “Laohu” cluster with 32 NVIDIA K20 GPU, 
one million particles take about 1500 s. Although we can¬ 
not compare the two codes directly with different computing 
cluster specifications, with the similar GPU type and num¬ 
ber NBODY6-I--I-GPU can reach better performance. GPU 
cores are ignored here because the time fractions on the GPU 
for these two codes are very different: c^GPU spends about 
90% computing time on the GPU while NBODY6-I--I-GPU 
has much less time fraction (Section 14.311 . In the case with 
5% primordial binaries, the scaling is not as good as for the 
case with no binaries due to the KS calculation. We discuss 
this issue in more detail in Section |6] 

The regular and irregular integration times in Figure 
are close to ideal for million-body simulations. It means that 
when ignoring MPI communications, the MPI paralleliza¬ 
tion speeds up regular and irregular calculation excellently 
for large number of particles. For small particle numbers 
(< 10®) the scalings of both regular and irregular integra¬ 
tions depart from the ideal parallel limit. The reason is that 
the number of operations on each node (for regular integra¬ 
tion is on GPU) is small. Thus, the cost of internal memory 
accessing and modification during integration, which can¬ 
not be scaled with computing cores or nodes, dominates the 
time. 


4.3 Time fraction for different parts 

We show the fraction of time spent on different parts of 
NBODY6-I—l-GPU in Figure 0 In the model without bina¬ 
ries, MPI communication and data moving consumes about 
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Figure 5. Performance of NBODY6++GPU with hybrid MPI 
on the “Hydra” cluster as the function of node number Ynode- 
Ttot/PNB shows the computing time cost per Y^-body time unit. 
The configurations of each node are indicated in the panels. The 
dashed line shows the ideal parallel limit with zero communication 
cost. Different colors represent different particle numbers. 


half of the total time in the case of 1024fc particles with 
A^node = 16 and 128fc particles with Afnode = 8, which means 
the scaling reaches the MPI parallelization speed-up break¬ 
even point. For the 1024fc particles with 5% binaries, the KS 
takes about half of the calculation time when A^node > 8. 
Thus, the KS procedures become the performance bottle¬ 
neck. 


4.4 Sorting list algorithm for selecting active 
particles 

In Figure [S] we compare the performance of the sorting list 
algorithm and temporary list algorithm described in Sec- 
tion l3.3.T1 The st ar cluster in our test simulation is modelled 
as a King sphere (iKin with Wo — 6 using 1024fc stars. 



Nnode 



Nnode 


Figure 6. Performance of regular and irregular integration on 
the “Hydra” cluster as the function of Ynode- Here the same node 
configurations and line types as in Figure[5]are used. Tirr/T^B and 
Treg/TkB shows the irregular and regular integration computing 
time cost per Af-body time unit respectively. 


5% of primordial binaries and 8 nodes with the same node 
configuration as in Figure [5] To indicate that the sorting is 
very fast, the time of pure sorting part in this algorithm is 
also shown. We can see the sorting list algorithm is about 5 
times faster than temporary list algorithm. 

The time fraction of active particle selection with the 
new algorithm is shown as yellow part (Init.B.) in Figured 
We can see Init.B. costs more for simulations with a larger 
number of particles. Even with this new method, it is close 
to irregular integration cost for the one million particles case 
7%). 


5 APPLICATION 

The main task for NBODY6-I—l-GPU is to simulate large 
star clusters. For a typical globular cluster, the total mass 
is 10® — 10 ® Mq, thus the total number of particles is of the 
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128 k without primordial binaries 


Figure 7. Pie charts showing the same test as Figure [3] but 
time fraction of different components in Hybrid MPI parallel 
NBODY6H—h. In each chart different rings show different A^node- 
From inside to outside rings, Ynode 1? 2, 4, 8 and 16. The two 
pie charts on the left show the model without primordial binaries 
and the pie chart on the right shows the model with 5% binaries. 
The models in top two pie charts include 1024fc particles (singles 
+ binaries) and the model in the pie chart at the bottom include 
128A; single particles. An explanation of the legends is provided 
in Table FTI 
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Figure 8. Comparison of performance between the sorting list 
algorithm and the temporary list algorithm. The “Pure Sorting” 
means the time cost of sorting part in sorting list algorithm. 
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order 10®. The typical age is about 12 Gyr. In our IM stars 
with 5% primordial binary globular cluster model (the same 
as shown in F igure ISll. we choose the parameters similar 
to NGC 4372 jHarriaTlQQSll . The initial half-mass radius is 
7.5 pc and the tidal radius is 89.2 pc with a circular orbit 
around a point-mass galactic potential. One V-body time 
unit corresponds to 0.622 Myr. The computing time and 
number of particles are shown in Figure [O] 

Initially, the computing time per V-body time unit was 
about 3000 s and this increased when several small time step 
particles formed. Later, we carried out several adjustments, 
then the simulation sped up and became about 1500 s. The 
number of particles only decreased slightly during 4500 N- 
body time units, but the computing speed actually increased 
at a later stage. The reason for the early slow speed was the 
two unsuitable criteria for triggering or terminating the two- 
body KS regularization. The first is the separation criterion 
Rci and the second is the time step criterion Atd- If the auto- 
adjustme nt of Rd and Afci are used, they are determined 
following lAarse t3 (l2003ll 


Rci = 


4i?h 

iV(pd/ph)i/®’ 


Atci 0.04 



( 1 ) 


where pdjph is the central density contrast, rji is the stan¬ 
dard irregular time step coefficient and (m) is average mass. 
The factor 4Rh/N is the impact parameter for a 90 degree 
deflection in a two-body encounter. The auto-adjustment re¬ 
sults in Rci = 1.4x 10“® and Atd = 6.8 x 10“® V-body units 
at the beginning of this simulation. But these values are too 
small and many wide binaries including some unperturbed 
binaries are not regularized. Thus we switched off auto¬ 
adjustment and used Rd = 5.0 x 10“® and Atd < 2.0 x 10“^^ 
before about 2800 time units. We found that the Rd and 
Atd parameters were still too small, thus we enlarged Rd to 
1.0 X 10“® and Atd to 5.0 x 10“^. Then the computing sped 
up after 2800 time units. The small parameters from auto¬ 
adjustment is because Eq. [T] is originally designed for small 
number of particles [N = 10^ — 10®). For the million-body 
simulation, the central density is usually high and pdiph is 
large. The criterion from Eq. [T] is only suitable for the cen¬ 
tral region of the cluster but too small for the outer region. 
There were several jumps in the computing time after the 
auto-restart with reduced time steps. These happened when 
a large energy error appeared due to specific events, such as 
difficult triple systems or the sudden change of force caused 
by large mass loss or premature perturbation of the neigh¬ 
bor sphere (such as neutron stars with high kick velocities). 
After we restore the normal time step parameters the com¬ 
puting time was again reduced. 

There are also a few more models currently in progress 
and we will report in detail about the results from these 
simulations in a future publication. 


6 DISCUSSION 

While standard Hermite codes report a high efficiency us¬ 
ing up to 700,000 cores on hundreds if n ot thousands 
of GPUs llBerczik, Sourzem fc Wand l2013l : iBerczik et al.l 
l2013ll . we find that our performance saturates at about 
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Figure 9. The evolution of the computing time per A'^-body time 
unit and the number of particles for the IM globular cluster sim¬ 
ulation as function of A^-body time. 


86,000 GPU cores and 320 CPU cores. This is not surprising, 
because NBODY6 and NBODY6-I--I- are inherently more ef¬ 
ficient than standard Hermite codes (less operations for the 
same physical result). A more detailed scaling analysis of 
NBODY6-I-+GPU will be published separately (Huang et 
al., private communication). 

As discussed in sections and 1131 the data move¬ 
ment and MPI communications become the bottleneck when 
the node number A^node is large since they have constant 
cost and the KS integration dominates the calculation when 
there are many primordial binaries. For the data copying 
and communication limit, a better communication algo- 
rit hm (such as non-blocking comm u nicati on as suggested 
by iDorband. Hemsendorf fc Merritt! ll2003ll , which we will 
probably work on in the future), a higher network bandwidth 
between nodes and faster memory access are required. 

In the common computer architecture today, the pure 
calculation operations for GPU is about two orders of mag¬ 
nitude faster than to access data from the host memory. 
For the non-shared memory parallelization like MPI, if the 
data communication consumption is larger than the calcu¬ 
lation, the parallelization cannot improve the performance 
and sometimes even reduces the speed. Table[2]compares the 
calculation and communication costs for the regular force, 
the irregular force and the KS perturbation calculations. 
The ratio of calculation cost to communication cost, Rc, for 
the regular force is proportional to the full particle number 
N. Thus, the GPU and MPI parallelization for the regular 
force gives a very good scaling. For the irregular force, Rc is 
proportional to the average neighbor number. When there 
are many neighbors, the MPI parallelization is good. For 
typical star cluster simulations, the neighbor number Nh is 
a few hundred, thus it is acceptable. In NBODY6-I-+GPU, 
the data movement and MPI communication is significant 
for the irregular force (FigureO- For KS perturbation calcu¬ 
lation, Rc is proportional to the average perturber number 
Np, which is usually quite small (less than 100). Thus, MPI 
parallelization for KS can be inefficient. The reason for the 


Table 2. Estimation of calculation and communication cost 


Cost 

Regular force 

Irregular force 

KS perturbation 

Calculation 

0{NiN) 

0(Afi(Afb» 

0(N,(Np}) 

Communication 

O{N0 

om 

O(iVi) 


*Ni: Active particle number 
*N-. Full particle number 
*(Yb)' Average neighbor number 
*{Np): Average perturber number for KS 


small Np is that usually in star cluster simulations, a large 
fraction of the KS binaries is unperturbed with Np = 0, and 
perturbed KS binaries also tend to have small Np (otherwise 
they would be terminated or transformed to hierarchical sys¬ 
tems). Therefore, to get good performance of KS paralleliza¬ 
tion, the unperturbed and perturbed KS parts should be 
treated separately, since unperturbed KS only needs few op¬ 
erations and should avoid communication when parallelized 
(shared memory parallelization such as OpenMP or MPI-3). 
We are working on this and will show our KS parallelization 
method and benchmarks in a future publication. There is 
also another effort to parallelize KS with block time steps 
(Nitadori 2014, private communication). 

We also find that the KS initialization and termination 
can be costly when there are wide binaries that frequently 
switch between KS and Hermite solutions. As discussed in 
Section f3. 3.2 1 during the KS initialization and termination, 
the force and its first three derivatives need to be renewed for 
center-of-mass particles or two components (cost of 0{N)) 
and the neighbor list of every particle and perturber list of 
KS pairs should be updated with new particle index (cost 
of 0{N(Nh)))- The regular part can be improved by using 
existing values instead of a direct calculation. The latter 
can be improved by using a reverse neighbor list for fast 
searching which particle has the KS pair as its neighbors. 
However, this requires large memory cost and coding effort. 

When testing the code performance in computer clus¬ 
ters, we usually use the empty nodes where no other tasks 
are performed simultaneously. But for the applications, 
whether we can use scheduling whole nodes depends on 
the task management system in the clusters. Some clus¬ 
ters, such as “Laohu” at NAOC and “Milkyway” at the 
Jfilich Computing center, only allow very few GPU cores for 
GPU tasks (1-2 GPU cores per GPU) and all other CPU 
cores in the same nodes are reserved for pure CPU tasks. 
NBODY6-I--I-GPU is not suitable for these kinds of clus¬ 
ters since it relies on heavy calculation on CPU (irregular 
and KS integration, data movements; see Figure [7]). More¬ 
over, in the shared nodes, different tasks compete with each 
other for network bandwidth, CPU loading and host mem¬ 
ory. This sometimes results in a serious load imbalance; The 
MPI barrier time (Table [TJ covers almost half of the total 
computing time. The only solution is to use computing clus¬ 
ters in which GPU nodes can be fully occupied by one GPU 
task each time. 

Both NBODY6 and NBODY6-I-+ have been developed 
over a long time. The codes have become more and more 
complicated which makes it difficult for beginners. There¬ 
fore, we also present documentation for the new version 
of NBODY6-I-+GPU. The document includes a detailed 
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description of all input parameters and output data and 
will be updated with more details and new implementa¬ 
tions. We also show several important differences between 
NBODY6-b-bGPU and NBODY6-GPU in the Appendix. 

The future improvements of the codes and hardware 
may lead to simulating even larger particle numbers, e.g., 
for nuclear star clusters using more GPU nodes appears 
feasible. The key to keep total wall clock times reasonable 
will be further optimization of communication and data 
management, especially for particles with very small time 
steps near a central black hole. Also, bandwidth and la¬ 
tency of communication hardware may help to gain one 
more order of magnitude, but not to reach the Exaflop/s 
regime. For the latter, hybrid codes seem more appropriate, 
which treat a large number of particles in the outskirts self- 
consistently, but not with full N'^ accura cy of the force com¬ 
putation (see, for recent examples, e.g ., iMeiron et al.l 12014 : 
iKarL Aarseth, Naab fc Haehneltll2013 b 


7 CONCLUSIONS 

Direct numerical simulations of star clusters contribute sig¬ 
nificantly to the theoretical understanding of star cluster 
dynamics. Due to hardware and software limits, direct N- 
body simulations of real globular clusters with large number 
of particles hav e been a major challenge for many years. 
ISugimoto et al.l (Il990l ) pointed out that direct numerical 
simulations of globular star clusters could not be completed 
for the next decades unless there are breakthroughs in par¬ 
allel computing which violate Moore’s law. After that, many 
efforts were made to reach this goal by using specially de¬ 
signed acceleration hardware (GRAPE and GPU). 

In this paper, we present NBODY6H—hGPU. It com¬ 
bines for the first time the massively parallel multi¬ 
node code (MPI parallelized) NBODY6++ Jsp urzem|[l9^ : 
iHemsendorf. Khalisi. Omarov fc Spurzeinl I 2 OO 3 I I with the 
GPU and AVX/SSE accelerati o n on each node, using the 
libraries of iNitadori fc AarsethI ll2012tl . We discuss the per¬ 
formance tests (Figure (4] O [6] and [T]) and new algorithms 
(Figure [T] [ 2 ] and [8]) to accelerate the NBODY6-I—l-GPU. For 
the non-binary case, the overall scaling is good up to 16 
nodes (320 CPU cores and 32 NIVDIA K20x GPUs includ¬ 
ing 86016 GPU cores) with a speed up of 400 up to 2000 
depending on the particle numbers. The speed up is mainly 
achieved by the usage of GPUs to accelerate the long-range 
(regular) gravitational forces, which gives about 33 times 
faster force calculation (Figure U). The AVX/SSE increase 
the speed of prediction of positions and velocities and neigh¬ 
bor (irregular) forces by a factor of 3. We also worked on 
the consistency of the code when combining several parallel 
methods together to ensure the stability. When GPU and 
AVX/SSE accelerate the force calculation very efficiently, 
other parts become bottlenecks of performance, such as time 
step scheduling and stellar evolution. We designed new al¬ 
gorithms to improve these parts. 

We have demonstrated how NBODY6H—hGPU can sim¬ 
ulate a realistic globular cluster with one million particles, 
stellar evolution and 5% primordial binaries for several Gyr 
(one half-mass crossing time requiring about an hour com¬ 
putational time; see Figure [9]). A few more models are cur¬ 
rently in progress and we will report the detailed results of 


these simulations in future publications. With our final code 
version, which is publicly availablfl we can claim to have 
finally reached the goal of Sugimoto’s “dream” of 1990. A 
million-body cluster can be simulated for about 20 crossing 
times in one day on 320 cores with 32 GPUs. In the future, 
with the faster bandwidth and latency of hardware as well 
as optimizations of communication and data management, 
even larger system like the nuclear star clusters may be sim¬ 
ulated by direct V-body codes. 

The previous paragraphs show that our contribution 
to this would be impossible without the achievements of 
our predecessors and collaborators; in particular the cur¬ 
rent dominance of GPU hardware has been assisted by the 
development of GRAPE software over the last few decades 
which finally could be ported to GPU without fundamental 
problems. 
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Table Al. Differences between NBODY6++ and NBODY6 



Subroutine 

NBODY6++ 

NBODY6 

Installation 


Use configure script (see GPU Autoconf 
software^) 

Use Makefile 

Parallelization 


Can enable/disable features among MPI, 
GPU, OpenMP and AVX/SSE, except 
AVX/SSE requires OpenMP enabled 

Use GPU, OpenMP and AVX/SSE to¬ 
gether or OpenMP with AVX/SSE (only 
OpenMP or GPU with OpenMP are not 
supported) 

Data files 


Rename most of output data files, change 
contents of some files and describe all 

data in a manual 

Describe in a document 

Basic data ini¬ 
tialization 


Support different reading data format 
(see the manual for option KZ(22)) 

Support A-body and astronomical unit 
data format 

Primordial bi¬ 
nary initializa¬ 
tion 

binpop.[f/F] 

Support period distribution l|Kroupa| 
|199^ 

Support modified period distribution 
with maximum semi-major axis 1000 AU 
and mimimum period 1 day 

Neighbor crite¬ 
rion 

regcor_gpu.f 

gpucor.f 

Adjust neighbor number to input param¬ 
eter NNBOPT 

Adjust neighbor number based on density 
contrast 

Stellar evolu¬ 
tion 

kick.[f/F] 

hrdiag.f 

Use Maxwellian distribution of neutron 
star kick velocity with velocity dispersion 
265 km/s (one dimension: iHobbs et al.l 
|2Q(?5l) 

Use mass fall back for black hole kick (in- 

Use Maxwellian distribution of kick ve¬ 
locity with velocity dispersion 2x VSTAR 
(velocity scaling factor) and maximum 
kick velocity 10 x VSTAR 

Use black hole mass based on 


kick.[f/F] 

crease the remnant mass and reduce kick 

Eldridere & ToutI ll2004|j (mav add 


velocity: iBelczvnski. Kaloeera Sz BulikI 

Belczvnski. Kaloeera & Bulild 12002! kick 


brake4.f 

intgrt.[f/F] int- 

grt.omp.f 

mdot.[f/F] 

|2QI?2|) 

Apply mass loss only during regular step 
for thread-safety 

Only apply force correction when large 
mass loss happens (less accuracy but 
much faster) 

method in the future) 

Support gravitational radiation analyti¬ 
cal orbit shrinkage 

Apply mass loss with minimum time step 
100 years 

Calculate new force and its derivates for 
large mass loss 

Galactic tidal 

force 

xtrnlf.f fbulge.f 

Support point-mass + disk -h halo + 
Plummer model 

Support point-mass + disk -h halo H- 
bulge -h Plummer model 


^ http://www.gnu.org/software/autoconf/ 


APPENDIX A: DIFFERENCES BETWEEN NBODY6++ AND NBODY6 

There are several differences with NB0DY6 . Table [X| lists some of the most important. The manual in the NB0DY6++GPU 
code directory gives more details. 





























